Content-dependent chunking for differential compression, the local maximum approach

نویسندگان

  • Nikolaj Bjørner
  • Andreas Blass
  • Yuri Gurevich
چکیده

When a file is to be transmitted from a sender to a recipient and when the latter already has a file somewhat similar to it, remote differential compression seeks to determine the similarities interactively so as to transmit only the part of the new file not already in the recipient’s old file. Content-dependent chunking means that the sender and recipient chop their files into chunks, with the cutpoints determined by some internal features of the files, so that when segments of the two files agree (possibly in different locations within the files) the cutpoints in such segments tend to be in corresponding locations, and so the chunks agree. By exchanging hash values of the chunks, the sender and recipient can determine which chunks of the new file are absent from the old one and thus need to be transmitted. We propose two new algorithms for content-dependent chunking, and we compare their behavior, on random files, with each other and with previously used algorithms. One of our algorithms, the local maximum chunking method, has been implemented and found to work better in practice than previously used algorithms. Theoretical comparisons between the various algorithms can be based on several criteria, most of which seek to formalize the idea that chunks should be neither too small (so that hashing and sending hash values become inefficient) nor too large (so that agreements of entire chunks become unlikely). We propose a new criterion, called the slack of a chunking method, which seeks to measure how much of an interval of agreement between two files is wasted because it lies in chunks that don’t agree. Finally, we show how to efficiently find the cutpoints for local maximum chunking.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The regional estimates of the GPS satellite and receiver differential code biases

The Differential Code Biases (DCB), which are also termed hardware delay biases, are the frequency-dependent time delays of the satellite and receiver. Possible sources of these delays are antennas and cables, as well as different filters used in receivers and satellites. These instrumental delays affect both code and carrier measurements. These biases for satellites and some IGS stations tend ...

متن کامل

Ddelta: A deduplication-inspired fast delta compression approach

Delta compression is an efficient data reduction approach to removing redundancy among similar data chunks and files in storage systems. One of the main challenges facing delta compression is its low encoding speed, a worsening problem in face of the steadily increasing storage and network bandwidth and speed. In this paper, we present Ddelta, a deduplication-inspired fast delta compression sch...

متن کامل

Chinese Chunking Based on Maximum Entropy Markov Models

This paper presents a new Chinese chunking method based on maximum entropy Markov models. We firstly present two types of Chinese chunking specifications and data sets, based on which the chunking models are applied. Then we describe the hidden Markov chunking model and maximum entropy chunking model. Based on our analysis of the two models, we propose a maximum entropy Markov chunking model th...

متن کامل

The analytical solutions for Volterra integro-differential equations within Local fractional operators by Yang-Laplace transform

In this paper, we apply the local fractional Laplace transform method (or Yang-Laplace transform) on Volterra integro-differential equations of the second kind within the local fractional integral operators to obtain the analytical approximate solutions. The iteration procedure is based on local fractional derivative operators. This approach provides us with a convenient way to find a solution ...

متن کامل

Discourse Chunking and its Application to Sentence Compression

In this paper we consider the problem of analysing sentence-level discourse structure. We introduce discourse chunking (i.e., the identification of intra-sentential nucleus and satellite spans) as an alternative to full-scale discourse parsing. Our experiments show that the proposed modelling approach yields results comparable to state-of-the-art while exploiting knowledge-lean features and sma...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • J. Comput. Syst. Sci.

دوره 76  شماره 

صفحات  -

تاریخ انتشار 2010